"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the fourth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.
This week will focus on standard visualizations of expression data. While our prior topics have focused on making visualizations that are applicable broadly to many types of data, a number of this week's visualizations are used mainly on multi-dimensional data from experiments like RNAseq.
At the end of this lecture you will have covered the following topics
goseqgrey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on differential expression analysis. The basic visualizations of this data after it has already been processed by packages like DESeq2.
This is an example dataset used for building a Sankey diagram. It builds a theoretical workflow for RNAseq analysis and visualization within a manuscript. What is a Sankey diagram? We'll find out!
RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151
This is an RNAseq dataset comparison for infection of multiple cell types with pathogens like SARS-CoV-2, SARS-CoV-1, MERS-CoV, RSV (respiratory syncytial virus), IAV (influenza A virus) and HPIV3 (human parainfluenze virus type 3) published in Cell doi: 10.1016/j.cell.2020.04.026
This is an RNAseq dataset comparison for infection of human alveolar adenocarcinoma (A549) cells with and without influenza A virus (IAV) published on bioRxiv doi: https://doi.org/10.1101/2020.03.24.004655
A saved file with a final GO annotation dataset we look at with our visualizations of dot plots.
repr- a package useful for altering some of the attributes of objects related to the R kernel.
tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2
viridis helps to create color-blind palettes for our data visualizations
networkD3, gghighlight, ggrepel, and UpSetR are used in building some of our visualizations including flowcharts, and upset plots.
goseq, and org.Hs.eg.db are packages that will help us perform some basic Gene Ontology (GO) annotations with our data.
# Install these packages onto JupyterHub
install.packages("networkD3")
install.packages("gghighlight")
install.packages('ComplexUpset')
install.packages('GGally')
# 220331: Current version of JupyterHub is not allowing goseq to be installed correctly
# # Some packages can be installed via Bioconductor
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("goseq")
# BiocManager::install("org.Hs.eg.db")
# Packages to help tidy our data
library(tidyverse)
library(readxl)
# Packages for the graphical analysis section
library(repr)
library(viridis)
# Lecture 04 visualization packages
library(networkD3)
library(GGally)
library(gghighlight)
library(ggrepel)
library(ComplexUpset)
# GO term analysis packages if you managed to install goseq
library(goseq)
library(org.Hs.eg.db)
Loading required package: BiasedUrn
Loading required package: geneLenDataBase
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Warning message:
"package 'BiocGenerics' was built under R version 4.0.5"
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:dplyr':
combine, intersect, setdiff, union
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: IRanges
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:geneLenDataBase':
unfactor
The following objects are masked from 'package:dplyr':
first, rename
The following object is masked from 'package:tidyr':
expand
The following object is masked from 'package:base':
expand.grid
Attaching package: 'IRanges'
The following objects are masked from 'package:dplyr':
collapse, desc, slice
The following object is masked from 'package:purrr':
reduce
The following object is masked from 'package:grDevices':
windows
Attaching package: 'AnnotationDbi'
The following object is masked from 'package:dplyr':
select
Until this point, the data sets we have used in lecture have been relatively simple epidemiological information. All of our observations are values like new cases or vaccinated individuals tied to dates or time periods. Whereas most of our observations were connected in some linear fashion the data we examine today will have tens of thousands of observations, each representing a different gene!
Given the complexity of our data, we will discuss ways to sort, partition, visualize and interpret it.
Many of you may be familiar with the idea of RNAseq data. It begins in the real world (or at the bench) with individuals/groups/conditions compared against a control state. Biological samples can be collected in time series or at a single endpoint. Every experimental conditions should have a matching control or baseline profile for comparison, not to mention biological replicates!
After collecting your samples you still need to prepare and sequence your libraries, check your data quality, trim reads, map them to a transcriptome, estimate/normalize the expression of genes within each library and then compare the expression between libraries/samples to determine if there are transcriptional differences! All of these steps are beyond this class BUT if you plan on working with transcriptomes, you'll eventually encounter this process. Today we'll jump the line and work directly with differential expression (DE) data. That means we'll be at the bottom right corner of the following diagram AKA, the end of the pipeline.
Image from Bioinformatics Workbook: RNA Sequencing Analysis
Sankey diagrams are named after Irish Captain Matthew Henry Phineas Riall Sankey, who first presented this type of visualization in 1898 to convey the energy efficiency of a steam engine!
With larger sets of data across multiple experiments, it can sometimes be helpful to use a Sankey diagram to explain the connection between different aspects of your samples. The key attribute of the Sankey diagram is that the width of a connection (flow) is proportional to the quantity represented. So the Sankey diagram is a flow diagram that also visually quantifies the dynamics of the relationships in your diagram. Think of it much like a stacked bar chart that can split or join with other bars at each connection.
A useful package to generate Sankey diagrams is networkD3 using the sankeyNetwork() function. To generate a Sankey diagram, you need information on two variables with a third optional one:
source: a starting point or for your flowtarget: the end point of your flowvalue: the number that will represent the width of your flow. Usually some quantity of contribution but this can also be an optional variable.We'll find a pre-made table to that describes a theoretical RNAseq workflow based in the datasets we'll be working with today. Some sections have been compressed for brevity but we'll find all of that information in ./data/Lecture04_sankey_data.csv. Let's begin by reading it in.
# Read in our Sankey diagram in ./data/Lecture04_sankey_data.csv
sankey_info.df <- read_csv("./data/Lecture04_sankey_data.csv")
# Check the structure
str(sankey_info.df)
# Look a few rows
head(sankey_info.df)
Rows: 24 Columns: 3 -- Column specification -------------------------------------------------------- Delimiter: "," chr (2): source, target dbl (1): value i Use `spec()` to retrieve the full column specification for this data. i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec_tbl_df [24 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ source: chr [1:24] "CoV" "CoV" "IVS" "RSV" ... $ target: chr [1:24] "A549" "NHBE" "A549" "A549" ... $ value : num [1:24] 0.25 0.25 0.25 0.25 0.7 0.2 0.4 0.4 0.15 0.15 ... - attr(*, "spec")= .. cols( .. source = col_character(), .. target = col_character(), .. value = col_double() .. ) - attr(*, "problems")=<externalptr>
| source | target | value |
|---|---|---|
| <chr> | <chr> | <dbl> |
| CoV | A549 | 0.25 |
| CoV | NHBE | 0.25 |
| IVS | A549 | 0.25 |
| RSV | A549 | 0.25 |
| A549 | RNAseq | 0.70 |
| NHBE | RNAseq | 0.20 |
Although our data frame sankey_info.df contains the information about our connections, the networkD3 package requires some mapping of the nodes within our data. Each unique source and target will be counted as a node and we'll save that information into sankey_nodes.df.
# Combine our source and target lists, then take the unique set and put that into the "name" vector.
sankey_nodes.df <- data.frame(name=c(sankey_info.df$source, # Location of our source nodes
sankey_info.df$target) %>% # Location of our target nodes
unique() # Generate the unique combination of the two sets
)
# Look at the structure, it's just a list of the nodes
str(sankey_nodes.df)
'data.frame': 18 obs. of 1 variable: $ name: chr "CoV" "IVS" "RSV" "A549" ...
match()¶Now that we have our nodes information stored basically as an identification number, we have to map it back to sankey_info.df. This is essentially like working with a factor that spans two columns instead of just one. To accomplish the mapping, we'll use the match() method. The match() function takes two values:
x: the values to be matchedtable: the values to be matched againstand returns a vector of length x where each integer value represents the position of its matched value from table. Unmatched values from x (ie not found in table) are set to nomatch.
Let's try with a quick example with our data
# Which nodes match from our source nodes to our collected nodes (18 total nodes)?
match(sankey_info.df$source, sankey_nodes.df$name)
Note that our node information must be 0-indexed! R uses an indexing scheme that begins at 1 so we'll need to offset that information that we generate as well.
# Add our sankey_nodes.df as index information that maps between it and `sankey_info.df`
sankey_info.df <-
sankey_info.df %>% mutate(IDsource = match(sankey_info.df$source, sankey_nodes.df$name) - 1,
IDtarget = match(sankey_info.df$target, sankey_nodes.df$name) - 1) # nodes must be 0-indexed
# take a peek at what we've wrought
head(sankey_info.df)
| source | target | value | IDsource | IDtarget |
|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| CoV | A549 | 0.25 | 0 | 3 |
| CoV | NHBE | 0.25 | 0 | 4 |
| IVS | A549 | 0.25 | 1 | 3 |
| RSV | A549 | 0.25 | 2 | 3 |
| A549 | RNAseq | 0.70 | 3 | 5 |
| NHBE | RNAseq | 0.20 | 4 | 5 |
sankeyNetwork()¶Now that we have our two necessary data frames, we can build our Sankey diagram with the sankeyNetwork() function. This function has a few parameters that we'll want to set:
Links: the location of our source/target relationship information (sankey_info.df).Nodes: the listing of nodes in our diagram (sankey_nodes.df).Source: the name of the source column (IDsource).Target: the name of our target column (IDtarget).Value: the name of our "optional" value to build the width of flows (value).NodeID: the names of our nodes (sankey_nodes.df$name).There are an additional number of parameters that allow you to play with the visualization of this diagram including:
colourScale: categorical colour of your node. Works together with...NodeGroup: for colouring your nodes based on an extra column in Nodes.LinkGroup: for colouring your links based on an extra column in Links.nodeWidth: the numeric width of each node.For more information on these parameters and others, you can go here
# Build the Sankey diagram with all of our information
sankeyNetwork(Links = sankey_info.df, # Where is the "edge" information
Nodes = sankey_nodes.df, # Where is the "node" information
Source = "IDsource", # Where are the source node labels
Target = "IDtarget", # Where are the target node labels
Value = "value", # What value do we assign to the relationships/edges
NodeID = "name", # Where are the labels for the nodes stored?
sinksRight = TRUE, # Push the nodes to the right
fontSize = 10)
Links is a tbl_df. Converting to a plain data frame.
Stepping back one step in our flowchart, just prior to your DE analysis, you'll have a set of read counts generated from RNA-Seq data. The counts are an estimation of each transcript within your data. Some programs may include the analysis of spliced isoforms and others may not.
Often in your RNA-Seq datasets, you should have multiple replicates of your data. A quick way to assess the quality of your datasets is to generate a scatterplot matrix. The scatterplot matrix is an all-by-all assessment of your experiments that generates a scatterplot of read counts for each pair-wise dataset.
| A scatterplot matrix can help to quickly assess the quality and consistency of your data. Figure taken from: Visualization methods for differential expression analysis. Rutter et al., 2019. BMC Bioinformatics 20: 458 |
We'll being by importing our data from Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. This tab-separated data file contains total RNA experiments from the infection of airway epithelial cells (AECs) under different conditions, treatments with inhibitors, and timepoints.
# Read in your read_count data
wyler_readcounts.df <- read_tsv("./data/Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv")
str(wyler_readcounts.df, give.attr = FALSE)
Rows: 27011 Columns: 38 -- Column specification -------------------------------------------------------- Delimiter: "\t" chr (1): gene dbl (37): width, AECII_01_24h_DMSO-1, AECII_02_24h_DMSO-2, AECII_03_24h_DMSO... i Use `spec()` to retrieve the full column specification for this data. i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec_tbl_df [27,011 x 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ gene : chr [1:27011] "A1BG" "A1BG-AS1" "A1CF" "A2M" ... $ width : num [1:27011] 1766 2130 9529 4653 2187 ... $ AECII_01_24h_DMSO-1 : num [1:27011] 22 55 0 0 0 ... $ AECII_02_24h_DMSO-2 : num [1:27011] 19 20 0 1 0 ... $ AECII_03_24h_DMSO-3 : num [1:27011] 15 38 0 0 2 ... $ AECII_04_24h_200nM17AAG-1 : num [1:27011] 29 11 0 0 2 4740 0 0 882 0 ... $ AECII_05_24h_200nM17AAG-2 : num [1:27011] 27 15 0 0 0 ... $ AECII_06_24h_200nM17AAG-3 : num [1:27011] 30 17 0 0 2 4220 0 0 820 0 ... $ AECII_07_24h_S2_DMSO-1 : num [1:27011] 24 41 0 0 0 ... $ AECII_08_24h_S2_DMSO-2 : num [1:27011] 18 46 0 0 2 ... $ AECII_09_24h_S2_DMSO-3 : num [1:27011] 22 42 0 0 4 2750 0 0 739 0 ... $ AECII_10_24h_S2_200nM17AAG-1: num [1:27011] 23 13 0 1 3 ... $ AECII_11_24h_S2_200nM17AAG-2: num [1:27011] 25 24 0 0 1 ... $ AECII_12_24h_S2_200nM17AAG-3: num [1:27011] 18 28 1 0 1 ... $ AECII_13_48h_DMSO-1 : num [1:27011] 24 57 0 0 0 ... $ AECII_14_48h_DMSO-2 : num [1:27011] 38 56 0 0 0 1930 0 0 892 0 ... $ AECII_15_48h_DMSO-3 : num [1:27011] 33 64 0 0 0 ... $ AECII_16_48h_200nM17AAG-1 : num [1:27011] 29 35 0 0 1 ... $ AECII_17_48h_200nM17AAG-2 : num [1:27011] 20 24 0 0 0 ... $ AECII_18_48h_200nM17AAG-3 : num [1:27011] 23 28 0 0 0 ... $ AECII_19_48h_S2_DMSO-1 : num [1:27011] 13 45 0 0 0 ... $ AECII_20_48h_S2_DMSO-2 : num [1:27011] 22 36 0 0 0 ... $ AECII_21_48h_S2_DMSO-3 : num [1:27011] 21 30 0 0 1 ... $ AECII_22_48h_S2_200nM17AAG-1: num [1:27011] 31 29 0 0 1 ... $ AECII_23_48h_S2_200nM17AAG-2: num [1:27011] 36 23 0 0 1 ... $ AECII_24_48h_S2_200nM17AAG-3: num [1:27011] 20 19 1 0 0 ... $ AECII_25_72h_DMSO-1 : num [1:27011] 35 43 0 1 2 ... $ AECII_26_72h_DMSO-2 : num [1:27011] 26 38 0 0 1 997 0 0 753 0 ... $ AECII_27_72h_DMSO-3 : num [1:27011] 29 53 0 0 4 1230 0 0 953 0 ... $ AECII_28_72h_200nM17AAG-1 : num [1:27011] 32 57 0 0 2 ... $ AECII_29_72h_200nM17AAG-2 : num [1:27011] 41 49 0 0 2 1050 0 0 553 0 ... $ AECII_30_72h_200nM17AAG-3 : num [1:27011] 33 37 0 0 3 857 0 0 458 0 ... $ AECII_31_72h_S2_DMSO-1 : num [1:27011] 24 58 0 0 1 ... $ AECII_32_72h_S2_DMSO-2 : num [1:27011] 33 63 0 0 3 ... $ AECII_33_72h_S2_DMSO-3 : num [1:27011] 23 47 1 1 2 ... $ AECII_34_72h_S2_200nM17AAG-1: num [1:27011] 22 39 0 1 1 ... $ AECII_35_72h_S2_200nM17AAG-2: num [1:27011] 28 26 0 0 2 866 0 0 481 0 ... $ AECII_36_72h_S2_200nM17AAG-3: num [1:27011] 43 34 0 0 3 ...
Before we proceed to generating our scatterplot matrix, we need to select our data from our main dataset. In particular, the data from Wyler et al., that we imported as wyler_readcounts.df contains data from infecting human airway epithelial cells (AECs) and treating with either DMSO or 200nM of the HSP90 inhibitor, 17-AAG. With each condition there is a control set of uninfected AECs. For each of these 4 conditions, there are 3 replicates and 3 timepoints (24-, 48-, and 72-hours). In total that amounts to 36 sets of data.
For our purposes, we'll:
if_all() function.AECII_xx_)Note that the if_all() function is a rather new addition to the tidyverse and is used to help filter on a predicate across multiple columns. This is very much like the across() helper function we've used previously for summarizing data. Like the across() function, the if_all() helper function uses the following parameters:
.cols: the columns you wish to apply your filtering predicate upon..fns: the function or predicate you are filtering with. You'll notice our use of ~ again to anonymize the function.readcounts_24hv72h <-
wyler_readcounts.df %>%
# Select for just the columns withs SARS-CoV-2 infection and DMSO at 24 and 72h
dplyr::select(c(9:11, 33:35)) %>%
# Restrict our data analysis to just readcounts above 10 and below 5000
filter(if_all(.cols = everything(), .fns = ~ .x >10 & .x < 5000)) %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = ""))
# Take a peek at our resulting tibble
str(readcounts_24hv72h)
tibble [12,430 x 6] (S3: tbl_df/tbl/data.frame) $ 24h_S2_DMSO-1: num [1:12430] 24 41 2534 712 204 ... $ 24h_S2_DMSO-2: num [1:12430] 18 46 2484 654 205 ... $ 24h_S2_DMSO-3: num [1:12430] 22 42 2750 739 225 ... $ 72h_S2_DMSO-1: num [1:12430] 24 58 958 1065 147 ... $ 72h_S2_DMSO-2: num [1:12430] 33 63 1163 1081 156 ... $ 72h_S2_DMSO-3: num [1:12430] 23 47 835 1022 124 ...
ggpairs() to generate a scatterplot matrix¶You may recall our working with the parallel coordinate plot from lecture 2. Rather than transform the data ourselves, we passed our data along to the ggparcoord() function from the GGally package. Well that package is here to simplify our lives again! Rather than generating the paired data between each experiment ourselves, we can pass along the read count matrix we just generated to the ggpairs() function. Parameters of the ggpairs() function that we'll use include:
data: the dataset containing our data. It can include both numerical and categorical data.mapping: the aesthetic mapping (aside from x and y) used for altering your format.columns: which columns from data are used to generate the plot. Defaults to all columns.title, xlab, ylab: the titles of your various graph components. upper and lower: named lists that may contain the variables "continuous", "combo", "discrete", and "na" used to determine how pairwise combinations of continuous and/or categorical data are plotted across the grid. You basically need to choose how these data combinations will be plotted. diag: a named list that may only contain the variables "continuous", "discrete", and "na". Each of which is set to a specific kind of plot type (ie "densityDiag", "barDiag", or "blankDiag").Moreover, you may recall that GGally works with the ggplot or is ggplot-extensible so we can use many of the same features from ggplot to fix some of the aesthetics of the result object. In the end, you'll notice that this is basically a faceted scatterplot but the dataframe required to generate it is more complex than what we currently have. For more information on this function, you can check out the GGally manual or check our the list of great examples provided on the authors' github.
# Adjust our plot window size according to the expected output
options(repr.plot.width=14, repr.plot.height=14)
# Create your matrix scatterplot with ggpairs
ggpairs(readcounts_24hv72h,
# Set the alpha of our points so we can see them better
mapping = aes(alpha = 0.4),
# We'll play with the correlation text so the values are little larger than default for us
upper = list(continuous = wrap("cor", size = 9))
) + # Set the alpha of our points so we can see them better
# Change some of the theme aspects like text size
theme(text = element_text(size = 20),
# Rotate the x-axis tick text to be
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
As you can see, the scatterplot matrix can provide a quick way to check that your replicates are similar, while your comparison conditions should show more variation between them. Looking at our above plot, the triangular sections of scatterplots in the top-left and lower-right represent the comparison of replicates in our experiment. These should produce scatter points that are close to the diagonal - meaning the replicates are quite similar in the data produced.
In the lower-left corner of the plot we find the comparison between our two conditions (24- vs 72-hour data) and you can see much more scatter away from the diagonal. If our conditions really do induce different transcriptional profiles, this is what we should expected to see.
Across the diagonal we see the density plot of our read counts. These aren't particularly helpful with this dataset as the data predominantly appears to have low read counts even after we filter for a minimum of 10 reads per gene.
This technique is also a useful way to identify potentially mislabeled samples before processing your data for analysis. Changes from the above pattern can signal the mislabeling of samples or changes to the process of sample preparation.
Digging down into your DE results, with the human genome, you are looking to identify trends in gene expression differences across 43,000 potential transcripts sequences - only half of which produce proteins. You can imagine that opening up a tabular file of that size could take a bit of time.
For each DE experiment there are a number of values generated. In a popular package like DESeq2 you will typically find:
| Variable | Description |
|---|---|
| Gene name | The names of your genes which may be in different formats like Entrez ID, Ensembl, or gene symbol |
| Base mean | Average of the normalized counts values, accounting for size factors, taken over all samples (and or replicates) |
| Log$_{2}$ fold-change | The effect size estimate - how much your gene's expression has changed from control/base conditions |
| Log fold-change SE | An estimate of effect size uncertainty |
| p-value | Hypothesis test against H$_{0}$ that no difference exists between sample groups |
| adjusted p-value | An adjusted p-value after taking into account multiple testing between groups/samples |
You can find more information on working with the DESeq2 package here
Once we have a set of data we can examine it at from many aspects:
Let's explore these different aspects and the kind of plots we can generate to accomodate the size of our data set. First, however, we need to find a useful data set to work with.
readxl library to open your .xlsx files¶We've briefly touched on using this package in Lecture 01 to help read our Microsoft Excel-based data. Today's data set from Blanc-Melo et al. (Cell 2020), comes to us in the form of an excel file. Let's use some of the tools from this package to help open up our data file. We'll start by peeking at how many sheets there are using excel_sheets().
# What are the sheets to open in our data set?
excel_sheets("./data/Blanco-Melo2020Cell.Supp1.xlsx")
read_excel()¶Now that we can see there are two sheets in our data, we can assign each one to a different data frame using the read_excel() command which contains the parameter sheet. We can use this to determine which sheet to load either based on it's position (integer) or it's name as specified by excel_sheets().
# Assign our legend
blanco_legend.df <- read_excel("./data/Blanco-Melo2020Cell.Supp1.xlsx", sheet=1)
# Assign our data
blanco_data.df <- read_excel("./data/Blanco-Melo2020Cell.Supp1.xlsx", sheet=2)
str(blanco_legend.df)
str(blanco_data.df)
tibble [20 x 3] (S3: tbl_df/tbl/data.frame) $ Field : chr [1:20] "SARS-CoV-2(A549)_L2FC" "SARS-CoV-1_L2FC" "MERS-CoV_L2FC" "RSV_L2FC" ... $ Explanation: chr [1:20] "Fold expresion change Infected vs Control (logarithmic scale to base 2)" "Fold expresion change Infected vs Control (logarithmic scale to base 2)" "Fold expresion change Infected vs Control (logarithmic scale to base 2)" "Fold expresion change Infected vs Control (logarithmic scale to base 2)" ... $ Notes : chr [1:20] "SARS-CoV-2 infection (A549 cells, MOI: 2, 24hpi)" "SARS-CoV-1 infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192)" "MERS-CoV infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192)" "RSV (A549 cells, MOI: 2, 24hpi)" ... tibble [23,710 x 21] (S3: tbl_df/tbl/data.frame) $ GeneName : chr [1:23710] "IFNL1" "IFNL2" "IFNL3" "IFNB1" ... $ SARS-CoV-1_L2FC : num [1:23710] 0 0 0 0 0 ... $ IAV_L2FC : num [1:23710] 1.37 1.379 1.815 0.468 0.24 ... $ MERS-CoV_L2FC : num [1:23710] 0 0.125 0 0 0.125 ... $ HPIV3_L2FC : num [1:23710] 7.84 6.84 6.23 6.44 2.62 ... $ RSV_L2FC : num [1:23710] 7.14 4.84 4.83 4.86 0 ... $ SARS-CoV-2(Calu-3)_L2FC : num [1:23710] 7.24 7.88 7.47 8.7 3.33 ... $ SARS-CoV-2(A549)_L2FC : num [1:23710] 0 0 0 -0.127 0 ... $ SARS-CoV-2(A549-ACE2)LowMOI_L2FC : num [1:23710] 0 0 0 0.388 0 ... $ SARS-CoV-2(A549-ACE2)HiMOI_L2FC : num [1:23710] 5.08 5.25 5.05 5.76 1.54 ... $ SARS-CoV-2(A549-ACE2)-Ruxolitinib_L2FC: num [1:23710] 3.548 1.947 2.511 3.92 0.297 ... $ padj_SARS-CoV-1 : num [1:23710] 1 1 1 1 1 ... $ padj_IAV : num [1:23710] 1.00 1.00 1.00 1.00 1.00 1.11e-06 1.00 1.00 1.00 1.00 ... $ padj_MERS-CoV : num [1:23710] 1 1 1 1 1 ... $ padj_HPIV3 : num [1:23710] 2.62e-68 1.48e-48 3.66e-35 2.13e-42 1.00 ... $ padj_RSV : num [1:23710] 4.51e-44 9.74e-17 3.19e-16 4.92e-17 1.00 ... $ padj_SARS-CoV-2(Calu-3) : num [1:23710] 2.53e-139 5.69e-107 5.51e-102 0.00 4.20e-11 ... $ padj_SARS-CoV-2(A549) : num [1:23710] 1 1 1 1 1 ... $ padj_SARS-CoV-2(A549-ACE2)LowMOI : num [1:23710] 0 0 0 0 0 ... $ padj_SARS-CoV-2(A549-ACE2)HiMOI : num [1:23710] 1.00 2.69e-20 2.37e-18 1.15e-28 1.00 ... $ padj_SARS-CoV-2(A549-ACE2)-Ruxolitinib: num [1:23710] 1.00 1.00 1.00 1.54e-12 1.00 ...
Yes, it looks like the legend has some helpful information about the dataset itself found in the second sheet. The DE data looks like it's split across 20 columns encompassing 10 experimental data sets. For each data set, it looks like there is a Log$_{2}$ fold-change value (L2FC) and an adjusted p-value (padj).
Let's take a closer look at blanco_legend.df first. From it we can parse out some important information about the experiments themselves like the pathogen and host in each experimental set.
# Look at the entire legend
blanco_legend.df
| Field | Explanation | Notes |
|---|---|---|
| <chr> | <chr> | <chr> |
| SARS-CoV-2(A549)_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | SARS-CoV-2 infection (A549 cells, MOI: 2, 24hpi) |
| SARS-CoV-1_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | SARS-CoV-1 infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192) |
| MERS-CoV_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | MERS-CoV infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192) |
| RSV_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | RSV (A549 cells, MOI: 2, 24hpi) |
| HPIV3_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | HPIV3 (A549 cells, MOI: 2, 24hpi) |
| IAV_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | IAV (A549 cells, MOI: 5, 9hpi) |
| SARS-CoV-2(Calu-3)_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | SARS-CoV-2 infection (Calu-3cells, MOI: 2, 24hpi) |
| SARS-CoV-2(A549-ACE2)LowMOI_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | SARS-CoV-2 infection (A549-ACE2 cells, MOI: 0.2, 24hpi) |
| SARS-CoV-2(A549-ACE2)HiMOI_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | SARS-CoV-2 infection (A549-ACE2 cells, MOI: 2, 24hpi) |
| SARS-CoV-2(A549-ACE2)-Ruxolitinib_L2FC | Fold expresion change Infected vs Control (logarithmic scale to base 2) | SARS-CoV-2 infection with Ruxolitinib (A549-ACE2 cells, MOI: 2, 24hpi) |
| padj_SARS-CoV-2(A549) | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | SARS-CoV-2 infection (A549 cells, MOI: 2, 24hpi) |
| padj_SARS-CoV-1 | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | SARS-CoV-1 infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192) |
| padj_MERS-CoV | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | MERS-CoV infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192) |
| padj_RSV | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | RSV (A549 cells, MOI: 2, 24hpi) |
| padj_HPIV3 | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | HPIV3 (A549 cells, MOI: 2, 24hpi) |
| padj_IAV | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | IAV (A549 cells, MOI: 5, 9hpi) |
| padj_SARS-CoV-2(Calu-3) | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | SARS-CoV-2 infection (Calu-3cells, MOI: 2, 24hpi) |
| padj_SARS-CoV-2(A549-ACE2)LowMOI | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | SARS-CoV-2 infection (A549-ACE2 cells, MOI: 0.2, 24hpi) |
| padj_SARS-CoV-2(A549-ACE2)HiMOI | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | SARS-CoV-2 infection (A549-ACE2 cells, MOI: 2, 24hpi) |
| padj_SARS-CoV-2(A549-ACE2)-Ruxolitinib | Benjamini-Hochberg adjusted p-values (Multiple comparisons adjustement) | SARS-CoV-2 infection with Ruxolitinib (A549-ACE2 cells, MOI: 2, 24hpi) |
str_match_all()¶You may remember that we can easily match for patterns in our string using functions from the stringr package but with a function like str_match_all() you can also include matching groups to your pattern that can be captured. For each string match, this function will return a matrix that contains the complete matching pattern and any grouped patterns that are denoted by the () parentheses.
If a string contains multiple matches to your pattern, it will generate additional rows in the matrix for that string.
Looking at our Notes section of blanco_legend.df, it looks like it follows a regular pattern were we can extract the pathogen and host cell information
| section | Examples |
|---|---|
| Pathogen name | SARS-CoV-2, RSV |
| Intermediate strings | infection, infection with Ruxolitinib |
| Host cell type | A549, A549-ACE2 |
# pattern match for pathogen and host information and save it into a datafram you can access later.
exp_info <-
blanco_legend.df %>%
# Use dplyr::select because the select function has been overrided by another library
dplyr::select(Notes) %>%
# Use str_match_all to grab the pattern we want, piece by piece
# example string: SARS-CoV-2 infection (A549-ACE2 cells, MOI: 2, 24hpi)
str_match_all(pattern = r"(([\w-]+)[\s\w]+\(([\w-]+)\s*cells)") %>%
# Set into a data frame
as.data.frame() %>%
# Change the column names (would be X1, X2, X3 otherwise)
magrittr::set_colnames(c("notes", "pathogen", "host"))
# Take a look at the resulting data
exp_info
Warning message in stri_match_all_regex(string, pattern, omit_no_match = TRUE, opts_regex = opts(pattern)): "argument is not an atomic vector; coercing"
| notes | pathogen | host |
|---|---|---|
| <chr> | <chr> | <chr> |
| SARS-CoV-2 infection (A549 cells | SARS-CoV-2 | A549 |
| SARS-CoV-1 infection (MRC5 cells | SARS-CoV-1 | MRC5 |
| MERS-CoV infection (MRC5 cells | MERS-CoV | MRC5 |
| RSV (A549 cells | RSV | A549 |
| HPIV3 (A549 cells | HPIV3 | A549 |
| IAV (A549 cells | IAV | A549 |
| SARS-CoV-2 infection (Calu-3cells | SARS-CoV-2 | Calu-3 |
| SARS-CoV-2 infection (A549-ACE2 cells | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2 infection (A549-ACE2 cells | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2 infection with Ruxolitinib (A549-ACE2 cells | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2 infection (A549 cells | SARS-CoV-2 | A549 |
| SARS-CoV-1 infection (MRC5 cells | SARS-CoV-1 | MRC5 |
| MERS-CoV infection (MRC5 cells | MERS-CoV | MRC5 |
| RSV (A549 cells | RSV | A549 |
| HPIV3 (A549 cells | HPIV3 | A549 |
| IAV (A549 cells | IAV | A549 |
| SARS-CoV-2 infection (Calu-3cells | SARS-CoV-2 | Calu-3 |
| SARS-CoV-2 infection (A549-ACE2 cells | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2 infection (A549-ACE2 cells | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2 infection with Ruxolitinib (A549-ACE2 cells | SARS-CoV-2 | A549-ACE2 |
# Add this new information to our original legend information
blanco_exp_info.df <-
blanco_legend.df %>%
# Make some new columns with pathogen and host information
mutate(pathogen = exp_info$pathogen, # Our pathogen information
host = exp_info$host) %>% # Our host information
# Combine our pathogen and host information into a single column as well
unite(col = exp_info, pathogen:host, sep="_", remove = FALSE) %>%
# Just keep the original Field column and the new ones
dplyr::select(1, 4, 5, 6)
# Take a look at what we are working with
blanco_exp_info.df
| Field | exp_info | pathogen | host |
|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> |
| SARS-CoV-2(A549)_L2FC | SARS-CoV-2_A549 | SARS-CoV-2 | A549 |
| SARS-CoV-1_L2FC | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 |
| MERS-CoV_L2FC | MERS-CoV_MRC5 | MERS-CoV | MRC5 |
| RSV_L2FC | RSV_A549 | RSV | A549 |
| HPIV3_L2FC | HPIV3_A549 | HPIV3 | A549 |
| IAV_L2FC | IAV_A549 | IAV | A549 |
| SARS-CoV-2(Calu-3)_L2FC | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 |
| SARS-CoV-2(A549-ACE2)LowMOI_L2FC | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2(A549-ACE2)HiMOI_L2FC | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 |
| SARS-CoV-2(A549-ACE2)-Ruxolitinib_L2FC | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 |
| padj_SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 |
| padj_SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 |
| padj_MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 |
| padj_RSV | RSV_A549 | RSV | A549 |
| padj_HPIV3 | HPIV3_A549 | HPIV3 | A549 |
| padj_IAV | IAV_A549 | IAV | A549 |
| padj_SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 |
| padj_SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 |
| padj_SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 |
| padj_SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 |
Now that we have catalogued our experimental information, let's build that into a long-format version of our actual DE dataset found in blanco_data.df. The steps we'll take to prepare our data are
pivot_longer().blanco_exp_info.dfexperiment_value so we can split it out properly.experiment_value information. Currently we'll have experiments mapping to both an L2FC and padj valuepivot_wider()blanco_data_long.df <-
blanco_data.df %>%
# collapse all of the data columns
pivot_longer(cols=(2:21), names_to = "experiment", values_to = "value") %>%
# add our experimental info to each observation. Note that each gene has multiple obs now.
full_join(., blanco_exp_info.df, by=c("experiment" = "Field")) %>%
# We need to fix all of the "padj_experiment" values to "experiment_padj" format
# Note we could have renamed our columns BEFORE pivoting
mutate(experiment = str_replace_all(.$experiment,
pattern=r"((padj)_([\w-\(\)]*))", # Capture the groups
replacement = r"(\2_\1)")) %>% # Put them back together in a switched order
# Now we want to split the experiment column into two: experiment and data_type (L2FC or padj)
separate(experiment, c("experiment", "data_type"), sep="_", remove=TRUE) %>%
# Pivot out the data so that each observation for each experiment has an L2FC and padj value
pivot_wider(names_from = data_type, values_from = value)
blanco_data_long.df
| GeneName | experiment | exp_info | pathogen | host | L2FC | padj |
|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
| IFNL1 | SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 | 0.000000 | 1.00e+00 |
| IFNL1 | IAV | IAV_A549 | IAV | A549 | 1.369700 | 1.00e+00 |
| IFNL1 | MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 | 0.000000 | 1.00e+00 |
| IFNL1 | HPIV3 | HPIV3_A549 | HPIV3 | A549 | 7.837988 | 2.62e-68 |
| IFNL1 | RSV | RSV_A549 | RSV | A549 | 7.142762 | 4.51e-44 |
| IFNL1 | SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 | 7.239900 | 2.53e-139 |
| IFNL1 | SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 | 0.000000 | 1.00e+00 |
| IFNL1 | SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.000000 | 0.00e+00 |
| IFNL1 | SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 5.077692 | 1.00e+00 |
| IFNL1 | SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 3.547628 | 1.00e+00 |
| IFNL2 | SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 | 0.000000 | 1.00e+00 |
| IFNL2 | IAV | IAV_A549 | IAV | A549 | 1.379280 | 1.00e+00 |
| IFNL2 | MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 | 0.125310 | 1.00e+00 |
| IFNL2 | HPIV3 | HPIV3_A549 | HPIV3 | A549 | 6.843922 | 1.48e-48 |
| IFNL2 | RSV | RSV_A549 | RSV | A549 | 4.838329 | 9.74e-17 |
| IFNL2 | SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 | 7.877920 | 5.69e-107 |
| IFNL2 | SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 | 0.000000 | 1.00e+00 |
| IFNL2 | SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.000000 | 0.00e+00 |
| IFNL2 | SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 5.246026 | 2.69e-20 |
| IFNL2 | SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 1.947436 | 1.00e+00 |
| IFNL3 | SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 | 0.000000 | 1.00e+00 |
| IFNL3 | IAV | IAV_A549 | IAV | A549 | 1.815420 | 1.00e+00 |
| IFNL3 | MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 | 0.000000 | 1.00e+00 |
| IFNL3 | HPIV3 | HPIV3_A549 | HPIV3 | A549 | 6.225043 | 3.66e-35 |
| IFNL3 | RSV | RSV_A549 | RSV | A549 | 4.827347 | 3.19e-16 |
| IFNL3 | SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 | 7.468010 | 5.51e-102 |
| IFNL3 | SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 | 0.000000 | 1.00e+00 |
| IFNL3 | SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.000000 | 0.00e+00 |
| IFNL3 | SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 5.045848 | 2.37e-18 |
| IFNL3 | SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 2.510746 | 1.00e+00 |
| ... | ... | ... | ... | ... | ... | ... |
| ZYX | SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 | -0.18373000 | 2.324264e-01 |
| ZYX | IAV | IAV_A549 | IAV | A549 | -0.08268000 | 9.116034e-01 |
| ZYX | MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 | -0.01463000 | 9.330340e-01 |
| ZYX | HPIV3 | HPIV3_A549 | HPIV3 | A549 | -0.79141202 | 2.379019e-01 |
| ZYX | RSV | RSV_A549 | RSV | A549 | -0.23437155 | 7.462522e-01 |
| ZYX | SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 | 0.42730000 | 1.435731e-02 |
| ZYX | SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 | 0.20066000 | 2.510894e-02 |
| ZYX | SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | -0.85993000 | 2.900000e-06 |
| ZYX | SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | -0.42121314 | 4.865060e-04 |
| ZYX | SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | -0.60178189 | 3.080000e-07 |
| ZZEF1 | SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 | 0.19146000 | 1.130588e-01 |
| ZZEF1 | IAV | IAV_A549 | IAV | A549 | -0.04038000 | 9.543538e-01 |
| ZZEF1 | MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 | 0.69079000 | 8.630000e-12 |
| ZZEF1 | HPIV3 | HPIV3_A549 | HPIV3 | A549 | 0.03538851 | 9.539934e-01 |
| ZZEF1 | RSV | RSV_A549 | RSV | A549 | 0.21774106 | 7.303373e-01 |
| ZZEF1 | SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 | 0.11892000 | 6.288487e-01 |
| ZZEF1 | SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 | -0.36100000 | 2.040000e-05 |
| ZZEF1 | SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.27388000 | 4.836400e-02 |
| ZZEF1 | SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.77057363 | 2.990000e-20 |
| ZZEF1 | SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.84691177 | 6.260000e-27 |
| ZZZ3 | SARS-CoV-1 | SARS-CoV-1_MRC5 | SARS-CoV-1 | MRC5 | 0.52811000 | 1.000000e+00 |
| ZZZ3 | IAV | IAV_A549 | IAV | A549 | 0.22277000 | 7.546104e-01 |
| ZZZ3 | MERS-CoV | MERS-CoV_MRC5 | MERS-CoV | MRC5 | 0.27730000 | 1.000000e+00 |
| ZZZ3 | HPIV3 | HPIV3_A549 | HPIV3 | A549 | 0.67102547 | 3.948680e-01 |
| ZZZ3 | RSV | RSV_A549 | RSV | A549 | 0.50359494 | 5.614496e-01 |
| ZZZ3 | SARS-CoV-2(Calu-3) | SARS-CoV-2_Calu-3 | SARS-CoV-2 | Calu-3 | 0.54586000 | 1.678310e-04 |
| ZZZ3 | SARS-CoV-2(A549) | SARS-CoV-2_A549 | SARS-CoV-2 | A549 | 0.64781000 | 5.870000e-12 |
| ZZZ3 | SARS-CoV-2(A549-ACE2)LowMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.21954000 | 3.634203e-01 |
| ZZZ3 | SARS-CoV-2(A549-ACE2)HiMOI | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.56665763 | 3.990000e-07 |
| ZZZ3 | SARS-CoV-2(A549-ACE2)-Ruxolitinib | SARS-CoV-2_A549-ACE2 | SARS-CoV-2 | A549-ACE2 | 0.58759967 | 1.220000e-07 |
After all of our data wrangling, you can see we have a data frame with nearly 1/4 million observations. This spans across 10 experimental conditions. From a broad level, we can look across entire data sets to identify genes of interest. There are a couple of ways to do this and we'll begin with a standard set of plots that can convey the overall distribution of our data. While our view of the data is low-resolution we are still able to compare specific gene information centered around mean expression levels, magnitude of comparison, and statistical significance.
Used originally in the application of Microarray data analysis, these plots are also applied to visualizing high-throughput sequencing analysis.
This is a data transformation between two sets of data such that
M represents the log$_{2}$ ratio of fold differences between your data setsA represents the average expression signal We plot our M values along the y-axis against the corresponding A values on the x-axis. Do these characteristics sound familiar? They are standard output for DESeq2 data sets! Unfortunately our above data doesn't have that information anymore, but we have a dataset that does! It's another dataset from an earlier pre-publication copy of the Blanco-Melo et al., 2020 paper: Blanco-Melo2020_Supp_Data_4.xlsx.
Let's open this up and take a look!
# We'll be looking at DE data for IAV infection of A549 cells
DE_A549_IAV.df <- read_excel("./data/Blanco-Melo2020_Supp_Data_4.xlsx", sheet=2)
# Look at the data frame we've made
head(DE_A549_IAV.df)
| GeneName | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | status |
|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| A1BG | 188.2808108 | 0.62513831855045399 | 0.35066646804794899 | 1.7827148459058599 | 7.4632733430825202E-2 | 0.36098583186238298 | OK |
| A1BG-AS1 | 71.8509755 | -0.271011828295395 | 0.51547397821110097 | -0.52575268539434905 | 0.59906005266250995 | 0.86261497584579405 | OK |
| A1CF | 36.1544502 | -1.4109515085643001 | 0.63504512806976798 | -2.22181298020964 | 2.6295947148654599E-2 | 0.20375590069103 | OK |
| A2M | 0.6370128 | 3.6682361685538201 | 4.7612924012057203 | 0.77042866924637798 | 0.44104565255438499 | NaN | Low |
| A2M-AS1 | 15.9095151 | -1.1067414911928199 | 0.92353059087077005 | -1.1983809763673401 | 0.23076873499770401 | 0.60966094504213497 | OK |
| A2ML1 | 0.0000000 | NaN | NaN | NaN | NaN | NaN | Low |
across()¶Okay, so we can't use the data right away. The read_excel() command has decided to convert some of our columns to character format because it detected NaN values. So we'll have to quickly coerce the data before moving on. We can quickly convert the data with a mutate() command using the helper verb across().
To successfully use this helper, we will want to list out the columns we want to use in the .cols parameter, and then apply a function like as.numeric. We don't actually need all of these columns, but it's good practice!
# Convert our numeric columns to the proper type.
DE_A549_IAV.df <-
DE_A549_IAV.df %>%
# Use mutate() with across() to convert columns to numeric
mutate(across(.cols = c(3:7), as.numeric))
# Check the result
str(DE_A549_IAV.df)
tibble [27,914 x 8] (S3: tbl_df/tbl/data.frame) $ GeneName : chr [1:27914] "A1BG" "A1BG-AS1" "A1CF" "A2M" ... $ baseMean : num [1:27914] 188.281 71.851 36.154 0.637 15.91 ... $ log2FoldChange: num [1:27914] 0.625 -0.271 -1.411 3.668 -1.107 ... $ lfcSE : num [1:27914] 0.351 0.515 0.635 4.761 0.924 ... $ stat : num [1:27914] 1.783 -0.526 -2.222 0.77 -1.198 ... $ pvalue : num [1:27914] 0.0746 0.5991 0.0263 0.441 0.2308 ... $ padj : num [1:27914] 0.361 0.863 0.204 NaN 0.61 ... $ status : chr [1:27914] "OK" "OK" "OK" "Low" ...
ggplot2¶Given that we already have our data in a differential expression format, we can just use ggplot2 to plot the correct variables to the x and y axes. In this case, we are interested in using the log2FoldChange as our M and baseMean represents our A. There are a number of ways we can colour this plot but we'll take into account our padj values to highlight those genes with DE values that are statistically significant.
Now, if we were using raw or normalized expression data, we would need to calculate the proper values for the transformation of M and A. There are a few packages that can take care of this for you such as ggmaplot which you can find here
For our plot, to differentiate between our points, let's use the gghighlight package from last week! What will our predicates be based on? Log2 fold change? Adjusted p-value?
# Adjust our plot window size according to the expected output
options(repr.plot.width=10, repr.plot.height=10)
# MA plot! Need to select based on a fold change > x and padj <= fdr
# Usually FC = 1.5, fdr = 0.05 but we'll use the "status" variable here instead.
DE_A549_IAV.df %>%
filter(status == "OK") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=baseMean, y = log2FoldChange) +
# Theme
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
# Use a log10 scale on the x axis
scale_x_log10() +
# Set the breaks on our y-axis
scale_y_continuous(limits = c(-10, 10), breaks = seq(-10, 10, 2)) +
# 4. Geoms
# Colour our points all in red
geom_point(colour = "red", alpha = 0.2, size=4, na.rm=TRUE) +
### 2.1.2
# Change it so that only high L2FC with low p-values will be in red, the rest as black
gghighlight(padj < 0.05, abs(log2FoldChange) > 1.5, unhighlighted_params = list(colour = "black"))
Warning message: "Using `across()` in `filter()` is deprecated, use `if_any()` or `if_all()`."
So our plot shows us the general distribution of the DE data. Looking at it, we can see, notably, that it appears that we have many more genes upregulated than downregulated in comparison to our control. We also see, as expected, more low-probability variation occuring with lower normalized mean counts. Conversely, as our overall mean counts increase, we also see less variation in general but we do see a gene that does pass threshold DE with a low enough p-value! To investigate further you could filter the data or look again with a volcano plot!
While the MA plots explored data based on its fold-change values in relation to mean expression, the volcano plot looks purely at differential expression based on it's p-value (ie the probability that the DE is real). To emphasize the importance of the p-value we will -log transform it so smaller p-values are higher up on the y-axis. This quickly partitions data points in the visualization to draw high-DE/low p-value combinations away from the bulk of the population.
To plot this data, we'll use our larger blanco_data_long.df which contains data from multiple experiments. For now, let's focus on just the SARS-CoV-2 data that has been generated in the A549 host cells. Our -log transformation of the p-values will also be problematic when encountering 0, so be careful! Let's compare plots with and without these problem values.
To label our points of interest, we'll also use the geom_text_repel() from the ggrepel package. This will help arrange and plot text onto our scatterplot in a way that avoids collisions with the data and the other labels! You could also experiment with using the directlabels package that we played around with last week to accomplish a similar feat.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
# Remove p-values = 0
volcano.plot1 <-
# Volcano plot!
blanco_data_long.df %>%
filter(pathogen == "SARS-CoV-2", host=="A549", padj !=0) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=L2FC, y = -log10(padj)) +
# Themes
theme_bw()+
theme(text = element_text(size=20)) +
labs(title = "Volcano plot with p-values == 0 removed") +
# 4. Geoms
# points all coloured red
geom_point(colour = "red", alpha = 0.7, size=4) +
# highlight leaves only the ones meeting our criteria as red, colour rest as blue
gghighlight(abs(L2FC) > 2, padj < 0.05, unhighlighted_params = list(colour = "lightblue")) +
### 2.2.0
# Add labels but only for the top 10 genes
geom_text_repel(data = blanco_data_long.df %>%
# Filter the data used to label
filter(pathogen == "SARS-CoV-2", host=="A549", abs(L2FC) > 2, padj !=0) %>%
# Arrange in ascending order
arrange(padj) %>%
# Take the first 10 results
dplyr::slice(1:10),
aes(label = GeneName), size=6)
Warning message: "Using `across()` in `filter()` is deprecated, use `if_any()` or `if_all()`."
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
# leave in p-values = 0
volcano.plot2 <-
# Volcano plot!
blanco_data_long.df %>%
filter(pathogen == "SARS-CoV-2", host=="A549") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=L2FC, y = -log10(padj)) +
# Themes
theme_bw()+
theme(text = element_text(size=20)) +
labs(title = "Volcano plot with all p-values") +
# 4. Geoms
# points all coloured red
geom_point(colour = "red", alpha = 0.7, size=4) +
# highlight leaves only the ones meeting our criteria as red, colour rest as blue
gghighlight(abs(L2FC) > 2, padj < 0.05, unhighlighted_params = list(colour = "lightblue")) +
# Add labels but only for the top 10 genes
geom_text_repel(data = blanco_data_long.df %>%
# Filter the data used to label
filter(pathogen == "SARS-CoV-2", host=="A549", abs(L2FC) > 2) %>%
# Arrange in ascending order
arrange(padj) %>%
# Take the first 10 results
dplyr::slice(1:10),
aes(label = GeneName), size=6)
Warning message: "Using `across()` in `filter()` is deprecated, use `if_any()` or `if_all()`."
# Plot both of our figures together
fig.show="hold"; out.width="50%"
volcano.plot1
volcano.plot2
Looking at our volcano plots, we can clearly see the volcano shape we are looking for. The data is split between our $\pm$ log$_{2}$ fold changes with points highlighted when we see DE beyond this. You can customize your parameters but you can also see the emphasis on lower p-values in our data set. The most "significant" data is found in the upper left and right quadrants of the graph, which wasn't made clear in the MA plot. You could use the same tricks to label your MA plot too but the data could end up anywhere along the MA plot.
In the end, if you do have access to the original counts, it may be of use to consider this information when further investigating your candidate genes from DE.
Now that we've taken an overall look at our data, we can begin to assess our data based on some candidate genes or hypothesis testing. For instance, you may only be interested in looking at genes with an L2FC > 3. In other cases, you may be interested in a group of genes pertaining to a function like the chemokines which are part of the inflammation response.
After selecting a subset of genes we can begin to compare their DE data more meaningfully across different sample sets. Let's continue working with our long-format dataset blanco_data_long.df which contains our multiple RNAseq experiments from various infection conditions. We will begin by generating a list of the highest DE genes with low p-values in the SARS-CoV-2 infection scenario.
# Select the genes we want to visualize
heatmap_gene.list <-
blanco_data_long.df %>%
# filter for data from a specific experiment, and then by high positive L2FC values with low p-values
filter(pathogen == "SARS-CoV-2", host == "A549", L2FC > 3, padj < 0.001) %>%
dplyr::select(GeneName) %>% unlist() %>% unname()
# Take a look at the resulting list
heatmap_gene.list
Now that we have generated a select group of genes to examine, our DE data is well-suited to visualizing in heatmap or heat plot. In our heatplot we can generate values along two categorical axes. On the y-axis we can plot DE values from our individual genes, while on the x-axis we will show values across multiple experiments. To visualize the data itself, we will use the geom_tile() command.
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=10)
# heatmap of our data based on a list of top DE genes from the SARS-CoV-2 infection of A549 cells
blanco_data_long.df %>%
filter(GeneName %in% heatmap_gene.list,
host == "A549"
) %>%
# mutate(GeneName = factor(GeneName, levels = heatmap_gene.list)) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
### 3.1.0 set aesthetics
aes(x=pathogen, y = GeneName, fill=L2FC) +
# Theme
theme_bw()+
theme(text = element_text(size=20)
#axis.text.x = element_text(angle=45, hjust = 1)
)+
# 3. Scaling
# Keep this to reverse the y axis or put it into the assignment?
scale_y_discrete(limits = rev) +
scale_fill_viridis_c(option="plasma") +
# Geoms
### 3.1.0 Use a tile and set the width a little smaller to give a gap between groups
geom_tile(width = 0.95)
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=14)
# heatmap of our data based on a list of top DE genes from the SARS-CoV-2 infection of A549 cells
# Split x-axis by experiment name
blanco_data_long.df %>%
filter(GeneName %in% heatmap_gene.list,
...
) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
aes(x=experiment, y = GeneName, fill=L2FC) +
# Theme
theme_bw()+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=45, hjust = 1)
) +
# 3. Scaling
# Keep this to reverse the y axis or put it into the assignment?
scale_y_discrete(limits = rev) +
scale_fill_viridis_c(option="plasma") +
# Geoms
# Use a tile and set the width a little smaller to give a gap between groups
geom_tile(width = 0.95)
From our list of the top 18 DE hits in our SARS-CoV-2 set, it looks like we see some similar hits across some of the genes when infecting A549 cells with other viruses like HPIV3 and RSV.
We see strong L2FC values for IL6, IL1A and LAMC2 for instance. IL6 (interleukin 6) has been reported to play a role in both mounting an effective immune response to certain viral infections but its interactions with other factors may also have a potential role in the exacerbation of viral phenotypes. In looking across different host cell infections by SARS-CoV-2 there is again consistent upregulation of IL1A and IL6.
Using a similar method, we can instead focus on genes from a specific family or pathway as long as we have a list. Let's try the family of chemokines which play a role in inducing cell movement during the immune reponse.
# Build a list of the potential chemokines by looking for those matching the "CCL" pattern
heatmap_ccl.list <-
blanco_data_long.df %>%
# Filter specifically for any genes that have CCL in their name
filter(...) %>%
dplyr::select(GeneName) %>%
unlist() %>%
unique()
str(heatmap_ccl.list)
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=10)
# heatmap of our data based on a list of chemokine genes
blanco_data_long.df %>%
# Filter for genes in our list, and experiments where the host cell was "A549"
filter(GeneName %in% ...,
host == "A549"
) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
aes(x=experiment, y = GeneName, fill=L2FC) +
# Theme
theme_bw()+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=45, hjust = 1)
)+
# 3. Scaling
scale_fill_viridis_c(option="plasma")+
# Geoms
# Use a tile and set the width a little smaller to give a gap between groups
geom_tile(width = 0.95)
From our heatmap, we can observe that looking only at the chemokines, there is consistent upregulation of CCL5 and CCL2 when infecting our A459 cells with any of these viruses. In comparison, however, HPIV3 and RSV infection show a much higher DE across the CCL family in comparison to SARS-CoV-2 which appears to illicit less inflamatory response in the time-frame sampled.
We can further narrow our search instead of looking at whole gene groups, and pick a small set of specific genes. We'll use a dotplot to visualize the data where the y-axis uses the L2FC value to place points and we'll use a categorical x-axis of gene lists. We'll filter our data to only include infections in the A549 host, and also group our data in a few ways:
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=14)
single_genes = c("CCL5", "CCL2", "CCL20", "CCL17", "IL6", "IL1A")
blanco_data_long.df %>%
filter(GeneName %in% single_genes,
) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 3.2.0 set the aesthetics for our dot plot colouring by pathogen, shaping by host
aes(x=GeneName, y = L2FC, colour=..., shape=...) +
# Theme
theme_bw()+
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_viridis_d(direction = -1,
guide = guide_legend(override.aes = list(size = 5))) +
# Scale the size range
scale_size(range = c(8, 4)) +
scale_shape(guide = guide_legend(override.aes = list(size = 5))) +
# 4. Geoms
### 3.2.0 adjust the dot size based on the adjusted p-value
geom_point(aes(size = ...), alpha = 0.8, stroke = 2) +
# 6. Facets
facet_wrap(~ host, ncol = 2, scales = "free_y")
Focusing on small sets of genes we can see some trends across our groups. For instance, 3 of our pathogens produce a strong IL6 response as we've seen in our heatmaps. , we see a >2 L2FC in CCL2 as a response to infection by SARS-CoV-2. Actually our weakest IL6 response to SARS-CoV-2 was seen in the ACE-2-expressing A459 hosts so our initial heatmap didn't give us a full picture of what might be happening.
It would be best to further investigate these on a more specific subset of hosts and/or pathogens.
Up until this point, we've been feeling our way through the data, rather undirected. Often within manuscripts you will find deeper analyses of differential expression by examining the collection of functional terms to determine trends or pathways of interest. The Gene Ontology (GO) resource represents a codification of gene function through three domains: cellular component (CC), molecular function (MF), and biological process (BP).
Most, if not all, of the genes in our dataset can be described in some way by a GO term. Once we collect these terms, we can begin to look for meaningful groups that could be considered over-represented (or under-represented) in our dataset. Once we have collected this information we can visualize it similar to our individual gene dotplots.
goseq package to pull down GO terms for your genes¶Using the goseq package we can perform a full GO term analysis on our DE data with an analysis for enriched terms within our set of over- or underexpressed genes. However, in order to begin the process of pulling down go terms, with the goseq package, we also need to generate a named integer vector to split our gene sets into the two relevant categories.
We'll store our categorized set of genes in genes_SARS_CoV_2 where a 1 represents over/underexpressed genes in our set, and 0 represents all remaining genes.
# Set up the gene information we need for the GO-term analysis
# It needs to know which genes meet our FC criteria and which do not
#nrow(blanco_data.df)
# Casting our logical as an integer will create a 0/1 matrix
genes_SARS_CoV_2 <- as.integer(blanco_data.df$`padj_SARS-CoV-2(A549)` < 0.05 & # low p-value
abs(blanco_data.df$`SARS-CoV-2(A549)_L2FC`) > 1.5) # absolute L2FC > 1.5
# We'll name each element in our vector by the gene names
names(...) = blanco_data.df$GeneName
# What is our object?
str(genes_SARS_CoV_2)
# Summarize our vector
table(genes_SARS_CoV_2)
The goseq package is compatible with a number of organisms and gene names for mapping your data to the GO database. To view which organisms are supported, we can use supportedOrganisms() to view. Since the raw RNA-Seq reads were aligned to the human genome using the hg19 assembly (see Blanco-Melo et al., 2020), we'll be working with hg19 as well. We'll use the annotation information provided for the next part of our analysis.
First, however, let's check that hg19 is indeed a supported genome annotation.
# Pull down the supported organisms and look for the human "hg19" genome
supportedOrganisms() %>%
filter(...)
Looking at our table above we can immediately see that there are 3 potential annotations sets to draw from. Each uses a different kind of ID description: Entrez Gene ID, Ensemble gene ID, and Gene Symbol. Which one do we use? This depends on how our data is formatted but let's take a quick look at this screenshot of CCL8 entry from NCBI:
| It's important to know the difference between your gene symbols and various identifiers! |
Knowing what we've seen of the data already, the information we're looking for is in the third row. Using the gene symbols in our data, we can
The second portion of information is required in the generation of a probability weighting function (PWF). This PWF will determine a weighting for each gene which is essentially the probability of a gene being differentially expressed based on its length alone. It's important to generate with information as it can influence our enrichment analysis. Ultimately we use the PWF to help inform the creation of a null distribution in our analysis.
We'll use the nullp() function to generate the PWF for our dataset.
# Generate the PWF for our vector of genes
pwf_SARS_CoV_2 <- nullp(DEgenes = ..., # Our list of DE genes
genome = ..., # The GO annotation we want to use
id = ...) # The id we'll use to generate the data using gene symbols
head(pwf_SARS_CoV_2)
We're now ready to query the GO database for over- and underexpressed terms in our data. We'll use the goseq() command to do this, providing our PWF object as well as the correct database information. The default method used in this process to generate our enrichment analysis is the Wallenius approximation. You can choose to use random sampling but this method can take much longer with little difference in results.
For more information on this process see the goseq package vignette here
# Generate our go enrichment analysis
GO.wall_SARS_CoV_2 <- goseq(pwf = ..., # Provide our pwf object
genome = "hg19", # Set the annotation set to use
id = "geneSymbol", # Which variable will we use to compare with
method = ... # How will you decide on the null distribution
)
# What does the final result look like?
head(GO.wall_SARS_CoV_2)
# What is the structure of our resulting data?
str(GO.wall_SARS_CoV_2)
goseq results¶Looking at the results of our analysis, we are returned a listing of 22,531 unique GO terms across the 3 domains (CC, BP, and MF). We are given two sets of p-values, one each for over-represented and under-represented terms. With each term we also see how many times that term is represented in our DE data vs the total number of times that term may appear in the GO database we queried. As you can see from the first few rows of data, some GO term categories can encompass a large number of genes.
To begin our analysis we can filter the enriched GO terms sets by applying a multiple hypothesis test correction with p.adjust() using the Benjamini-Hochberg method to minimize the false discovery rate.
# Generate an enriched Go term dataset
enriched.GO_SARS_CoV_2 <-
# Adjust the p-values of the over-represented column
# GO.wall_SARS_CoV_2[p.adjust(GO.wall_SARS_CoV_2$over_represented_pvalue, method="BH") < 0.05, ]
GO.wall_SARS_CoV_2 %>%
filter(...)
# How many terms come back from our filtering?
str(enriched.GO_SARS_CoV_2)
# Load our goseq data to look at and visualize
# This is especially helpful since goseq doesn't install on JupyterHub
load("./data/Lecture04.RData")
# What kind of terms do we see in our enrichment?
enriched.GO_SARS_CoV_2$term
Looking at our enriched set of GO terms, we see terms like inflammatory response and response to stress. Now that we have a set of enriched GO terms, we can filter and plot this information as a dotplot with the following characteristics:
Let's look at terms that include the word "response".
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=10)
enriched.GO_SARS_CoV_2 %>%
# Generate a percentage value
mutate(percentOfCat = numDEInCat/numInCat,
# Create a dummy category
# pathogen = "SARS-CoV-2",
# Make a set of adjusted p-values
FDR = p.adjust(over_represented_pvalue, method = "BH")
) %>%
# Filter for only terms that include the word "response"
filter(str_detect(term, "response")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 4.2.1 set aesthetics by numDEInCat and FDR
aes(x="SARS-CoV-2", y=..., size=..., colour = ...) +
# Theme
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_size(range=c(3,10)) +
scale_colour_viridis_c(direction = -1) +
# 4. Geoms
geom_point()
With a little bit of elbow grease, we can generate a data frame with multiple GO enrichment analyses for multiple experiments and now we can compare enrichment across data sets!
# Take a look at our combined data frame
enriched.GO_combined.df
# Load up a combined enrichment analysis
# load("./data/Lecture04.RData")
enriched.GO_combined.df %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 4.2.4 Set the new x-axis value and colour aesthetic
aes(x=..., y=term, size=numDEInCat, colour = ...) +
# Theme
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_size(range=c(3,10)) +
scale_colour_viridis_c(direction = -1) +
# 4. Geoms
geom_point()
Upset plots are an alternative to Venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, Venn diagrams can be difficult to interpret for greater than 2 or 3 sets. This is a real life figure from BMC Bioinformatics. Sure it looks pretty, but what does the number 24 represent in this picture in terms of A, B, C, D, and E?
| What is the meaning of the value 24 in this diagram? Stare at it long enough and you'll see which group it's in but imagine this was 10 groups? |
The UpSet plot was first published in 2014 and has become a helpful tool for visualizing the intersection between components with datasets. Along with the publication came a package for working with these plots. Known as the UpSetR package, there have since been more projects implementing this kind of visualization. While UpSetR has not been updated in nearly 3 years, a more active package known as ComplexUpset is available.
To some extent, the ComplexUpset has extensibility with ggplot, allowing you to use somewhat familiar syntax to modify these plots. Add to this the ability to stack or include additional visualizations of your datasets distributions or other characteristics, and this is a pretty attractive package to work with.
</br>
ComplexUpset package to visualize overlapping datasets¶Let's see how UpSet plots work practically. We can compare our data from blanco_data_long.df and compare the overlap of DE genes (after filtering or not). To do this in a practical sense, we again need to convert our values to a boolean representation after some more data wrangling. Our data wrangling steps will include:
Let's look at our data structure again.
head(blanco_data_long.df)
Begin by making our short-list of genes to filter by.
# filter for upregulated genes in our SARS-CoV-2/A549 data set and make a list of gene names
SARS_CoV_2.list <-
blanco_data_long.df %>%
# Filter our data
filter(pathogen == "SARS-CoV-2", host == "A549", ...) %>%
# Select just our gene names using a shorthand conversion
.$GeneName
# Less code than the following steps:
# dplyr::select(GeneName) %>% unlist() %>% as.character()
SARS_CoV_2.list
Filter our blanco_data_long.df and pivot to a wide-format containing our logical values for inclusion.
# Generate our upset data
blanco_upset.df <-
blanco_data_long.df %>%
# Filter our data list using our genes of interest
filter(GeneName %in% ...) %>%
# convert our L2FC data to a logical based on a value of >2
mutate(upregulated = ...) %>%
# Select just the gene names, experiments, and the logical variable
dplyr::select(GeneName, experiment, ...) %>%
# Pivot our data wide so that each experiment is now a row, each gene is a column
pivot_wider(names_from = ..., values_from = ...) %>%
as.data.frame()
str(blanco_upset.df)
As you can see from our data transformations, we now have all of our experiments listed in columns, with each gene represented in each row. As we look down each column we see a value of TRUE or FALSE used to differentiate if there was overexpression of that gene in that specific experimental context.
If we had not filtered based on some set of genes, we would have a data frame with more than 23K columns! Now that we have our genes presented in this way we can proceed to generate an upset plot.
Working with the upset() function to build our plot, we will concern ourselves with the following parameters:
data: a dataframe including binary columns that represent membership in each class.intersect: a list of column names which will be used to generate the intersecting data.name: the label shown below the intersection matrix.height_ratio: ratio of the intersection matrix to the intersection height (default = 0.5).width_ratio: ratio of the overall set size width to intersection matrix width (default = 0.3).min_size and max_size: minimal and maximal number of observations for an intersection to be included.min_degree and max_degree: minimal and maximal number of degrees for an intersection to be included.n_intersections: the maximum number of intersections to display based on our min_* or max_* criteria.themes: a parameter used to pass theme changes through the upset_default_themes() or upset_modify_themes() functions.# Adjust our plot window size according to the expected output
options(repr.plot.width=18, repr.plot.height=10)
# Generate our upset plot
upset(data = ..., # Provide our dataset
intersect = ..., # Name the columns we want to analyse
name='Experimental condition', # Set the label below the intersection matrix
width_ratio=0.1, # Make the Set size width a little smaller
min_size = ..., # Require there to be a minimum of 2 members to show an intersection
n_intersections = ..., # Set the max number of intersections we want to plot
themes = upset_default_themes(text = element_text(size = 20)) # Set the plot text size to 20
)
From our upset plot, we can see there are 3 regions.
From our set sizes, it looks like a L2FC value of > 2 was perhaps too stringent for the IAV and MERS-CoV DE data. Since we did filter our genes initially by the best hits in our SARS-COV-2(A549) dataset, we can see that it has the highest frequency group at 31 genes with over expression in just this set. The next largest overlapping set is of 14 genes between the SARS-CoV-2(A549) set and the RSV(A549) DE data.
You can also see from the plot that there are only 9 combinations of datasets with any real overlap. Had we set min_size = 1 instead, the remainder of the grouping combinations would be seen to contain just a single gene in each.
Overall this can make for a simplified interpretation of overlapping genes versus more complicated Venn diagrams!
Today we took a long look at some popular visualizations for expression data sets generated by an experiment like RNAseq. Our analyses focused more on highlighting aspects of the differential expression information itself from a big-picture low-resolution view like volcano plots to zooming down to the individual gene level with dot plots.
Next week we'll look at examples of "big" data from an even broader sense by using dimensional reduction techniques like classifying our datasets into groups with principle component analysis.
This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12pm on the date of our next class: April 7th, 2022.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Sankey diagram documentation: https://www.rdocumentation.org/packages/networkD3/versions/0.4/topics/sankeyNetwork
riverplot, another package for making your Sankey diagrams: https://cran.r-project.org/web/packages/riverplot/riverplot.pdf
MA plots directly from two sets of expression data: https://rpkgs.datanovia.com/ggpubr/reference/ggmaplot.html
Generating goseq data for RNA analysis: https://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf